Over the past few weeks, I’ve been diving headfirst into the world of Natural Language Processing (NLP), and honestly it’s been kind of mind-blowing. Learning the tools and exploring the endless use cases has completely opened my eyes to just how powerful NLP can be. So much of the world’s data is text, and being able to actually make sense of it and pull meaningful insights? Total game changer.
One area I’ve been particularly obsessed with is sentiment analysis, figuring out whether a piece of text sounds positive, neutral, or negative. I started off with the basics, experimenting with built-in datasets from the NLTK library and playing around with the VADER lexicon. But the real magic started happening when I got my hands on pretrained models from Hugging Face, models that are trained on specific types of text like tweets.
Now here’s where things took a turn. My wife recently got into Love Island USA, a show where a bunch of attractive 20-somethings date each other in a villa in Fiji. I originally watched to be a supportive husband… but I’ll admit it, I got hooked. The drama, the storylines, the unexpected twists, I was all in. And somewhere along the way, after watching my opinion of a contestant totally shift over time, it hit me: what if I could track public sentiment like that?
That spark turned into this project—analyzing Reddit discussions of Love Island USA to track how the internet feels about each contestant as the season unfolds.
Grabbing Data
Grabbing data for this was simple at first, but got more complex as the amount of data I needed to grab got bigger. There is a strong community following for the show under the subreddit r/loveislandusa. Within this subreddit, there are post-episode discussion threads for each episode. Each episode tends to get around 8k-15k comments, which is plenty for me to run my analysis.
After signing up for a Reddit API key, I could set up my API caall and grab the data I needed. However, this took some trial an error as I kept running into 429 Response errors, meaning I was asking reddit for too much data to quick. Luckily, after adding some *time.sleep* to slow down the requests, I was able to grab all the data I needed.
The following is a snippet of the data the api call grabs:
Show the code
import pandas as pdli_full = pd.read_csv('li_full.csv')li_full[['comment','score','created_utc','episode_title']].head()
comment
score
created_utc
episode_title
0
you vote for ace and iris to couple up and he’...
746
1.749867e+09
Season 7 - Episode 10 - Post Episode Discussion
1
you vote for ace and iris to couple up and he’...
746
1.749867e+09
Season 7 - Episode 10 - Post Episode Discussion
2
you vote for ace and iris to couple up and he’...
746
1.749867e+09
Season 7 - Episode 10 - Post Episode Discussion
3
Huda is so insanely childish, she 100% gave Je...
674
1.749867e+09
Season 7 - Episode 10 - Post Episode Discussion
4
Huda is so insanely childish, she 100% gave Je...
674
1.749867e+09
Season 7 - Episode 10 - Post Episode Discussion
Data Storage and Automation
Since the season is still airing with new episodes almost every day, I needed a way to keep the data up to date with the latest comments. To handle this, I basically reused my original Reddit API setup but added logic to skip any episodes I had already pulled. That way, it only grabs new ones as they come out.
Along the way, I also discovered how efficient parquet files are compared to CSVs. They store the same data in a much more compact format, which is perfect for saving space without losing structure.
With those pieces in place, it was easy to schedule a GitHub workflow to run daily at 4pm MST. That gives fans enough time to react to the previous night’s episode before a new one airs a few hours later.
API Call
def update_with_new_episodes(reddit, output_folder="data/season7_comments", max_retries=3): os.makedirs(output_folder, exist_ok=True)# Get a set of already-downloaded post IDs from filenames existing_files = { re.search(r'_(\w+)_comments\.parquet$', f).group(1)for f in glob.glob(f"{output_folder}/*_comments.parquet")if re.search(r'_(\w+)_comments\.parquet$', f) }# Step 1: Search for new Season 7 posts new_posts = []for post in reddit.subreddit("LoveIslandUSA").search("Season 7 Episode", sort="new", limit=500):if"Post Episode Discussion"in post.title and post.idnotin existing_files: new_posts.append({'post_id': post.id,'title': post.title,'created_utc': post.created_utc,'score': post.score,'num_comments': post.num_comments })ifnot new_posts:print("✅ No new episodes to update.")return pd.DataFrame() new_df = pd.DataFrame(new_posts).sort_values("created_utc") all_new_comments = []print(f"🆕 Found {len(new_df)} new episodes. Downloading now...")for idx, row in new_df.iterrows(): post_id = row['post_id'] title_safe = row['title'].replace('/', '_').replace(':', '').replace('"', '') filename =f"{output_folder}/{len(existing_files)+idx:02d}_{post_id}_comments.parquet"for attempt inrange(1, max_retries +1):try:print(f"\n🔄 Processing: {title_safe} (Attempt {attempt})") submission = reddit.submission(id=post_id) submission.comments.replace_more(limit=None, threshold=5) comment_data = [{'comment': c.body,'score': c.score,'created_utc': c.created_utc,'author': str(c.author),'episode_post_id': post_id,'episode_title': row['title'] } for c in submission.comments.list()] df_episode = pd.DataFrame(comment_data) df_episode.to_parquet(filename, index=False) all_new_comments.append(df_episode)print(f"✅ Saved {len(comment_data)} comments for: {title_safe}") time.sleep(3)breakexceptExceptionas e:print(f"⚠️ Error on attempt {attempt} for {title_safe}: {e}")if"429"instr(e):print("🛑 Rate limited. Sleeping for 60 seconds...") time.sleep(60)else: time.sleep(10)if attempt == max_retries:print(f"❌ Failed all {max_retries} retries for {title_safe}. Skipping.")# Combine and return all new dataif all_new_comments: master_df = pd.concat(all_new_comments, ignore_index=True)print(f"\n📦 Returning DataFrame with {len(master_df)} new comments.")return master_dfelse:print("📭 No new comments were successfully downloaded.")return pd.DataFrame()
Sentiment Analysis
Once I had the data, it was finally time to dive into sentiment analysis. But it wasn’t as simple as just running a basic sentiment function on each comment. I also needed to add some Named Entity Recognition (NER) to make sure I was capturing sentiment directed at specific people. For example, if a comment criticizes Huda but praises Jeremiah, I wanted the model to recognize that and assign the right sentiment to each person, not just a single overall score.
After running the sentiment and NER steps, I reshaped the data so that each row represents one comment, one islander mentioned in that comment, and the sentiment expressed toward them.
Sentiment Analysis
import refrom transformers import AutoTokenizer, AutoModelForSequenceClassificationimport torchimport numpy as npimport nltknltk.download('punkt_tab')from nltk.tokenize import sent_tokenizeimport spacy# Initialize islander listislanders = ['Chelley','Olandria','Huda','Ace','Nic','Taylor','Jeremiah','Austin','Charlie','Cierra','Hannah','Amaya','Pepe','Jalen','Iris','Yulissa','Belle-A']# Call Hugginface Modeltokenizer = AutoTokenizer.from_pretrained("cardiffnlp/twitter-roberta-base-sentiment")model = AutoModelForSequenceClassification.from_pretrained("cardiffnlp/twitter-roberta-base-sentiment")# Sentiment Functiondef get_sentiment_score(text): tokens = tokenizer(text, return_tensors='pt', truncation=True)with torch.no_grad(): output = model(**tokens) scores = torch.nn.functional.softmax(output.logits, dim=1)return {'negative': scores[0][0].item(),'neutral': scores[0][1].item(),'positive': scores[0][2].item(),'compound': scores[0][2].item() - scores[0][0].item() }# Targeted Sentiment Functionnlp = spacy.load("en_core_web_sm")def targeted_sentiment(comment, islanders): islander_sentiment = {} doc = nlp(comment)for sent in doc.sents: sentence_text = sent.text# Split each sentence into smaller chunks by contrastive conjunctions and commas raw_chunks = re.split(r'\bbut\b|\band\b|,', sentence_text, flags=re.IGNORECASE)for chunk in raw_chunks:iflen(chunk.strip()) ==0:continue chunk_doc = nlp(chunk) result = get_sentiment_score(chunk) chunk_lower = chunk.lower()for name in islanders:if name.lower() in chunk_lower:if name notin islander_sentiment: islander_sentiment[name] = [] islander_sentiment[name].append(result["compound"])return {name: np.mean(scores) for name, scores in islander_sentiment.items()}
For example, if we take the following comment:
“Huda was completely out of line during that argument, always playing the victim and stirring up drama. Meanwhile, Jeremiah stayed calm, listened respectfully, and brought everyone back down to earth.”
The model will output the following:
Test Function
comment ="Huda was completely out of line during that argument, always playing the victim and stirring up drama. Meanwhile, Jeremiah stayed calm, listened respectfully, and brought everyone back down to earth."print(targeted_sentiment(comment,islanders))
I could then apply the function to the enitire dataset and we are good to go!
Data Output
li_full.head()
comment
score
created_utc
episode_post_id
episode_title
islander
sentiment
episode_num
AirDate
0
you vote for ace and iris to couple up and he’...
746
1.749867e+09
1laxbqf
Season 7 - Episode 10 - Post Episode Discussion
Ace
-0.292615
10
2025-06-12
1
you vote for ace and iris to couple up and he’...
746
1.749867e+09
1laxbqf
Season 7 - Episode 10 - Post Episode Discussion
Iris
-0.292615
10
2025-06-12
2
you vote for ace and iris to couple up and he’...
746
1.749867e+09
1laxbqf
Season 7 - Episode 10 - Post Episode Discussion
Chelley
0.021599
10
2025-06-12
3
Huda is so insanely childish, she 100% gave Je...
674
1.749867e+09
1laxbqf
Season 7 - Episode 10 - Post Episode Discussion
Huda
-0.955151
10
2025-06-12
4
Huda is so insanely childish, she 100% gave Je...
674
1.749867e+09
1laxbqf
Season 7 - Episode 10 - Post Episode Discussion
Jeremiah
-0.288001
10
2025-06-12
Comment Summarization
Another goal for this project was to explore topic modeling and text summarization to give users a quick snapshot of what people are saying about a particular islander. I experimented with BERTopic and a few summarization models from Hugging Face, but I could never get anything that sounded truly coherent, it mostly just felt like a jumbled mashup of comments.
Failed Model
That’s when I decided to try out Google’s Gemini API, and it delivered exactly the kind of summaries I had in mind. All I had to do was initialize the API, pass in the comments for a specific islander, and it generated a clean and readable summary of what people were saying. It ended up being a perfect fit for this part of the app.
Gemini Example
Dashboard Development
As the project continues to evolve, I plan to spend more time improving the design and polish of the dashboard. For now, my focus was on getting a fully functioning proof of concept. The current version lets users select an islander to analyze, then displays a few basic metrics along with a line chart showing how the islander’s average sentiment has changed throughout the season. There’s also a button that triggers a summarization of viewer comments using the Gemini API, giving users a quick overview of what people are saying, as mentioned earlier.
Dashboard V.1
Key Challenges/Lessons Learned
This has been the most rewarding project I have ever worked on. It has pushed me to learn a great deal about Natural Language Processing and how to apply it in meaningful ways. Being able to extract insights from raw text and share them with users is incredibly valuable, and it is a skill I am excited to keep improving.
One of the biggest lessons I have learned from this project is the importance of organization and automation. In the past, many of my data science projects used static CSV files and manual processes that only worked once. For this project, I wanted to create something dynamic that could update and run independently. I learned how to structure my code so each function fits together cleanly, how to organize my files to make everything easier to maintain, and how to use GitHub workflows to automate updates. These steps helped me build a system that runs on its own and keeps everything up to date.
Comment Summarization
Another goal for this project was to explore topic modeling and text summarization to give users a quick snapshot of what people are saying about a particular islander. I experimented with BERTopic and a few summarization models from Hugging Face, but I could never get anything that sounded truly coherent, it mostly just felt like a jumbled mashup of comments.
That’s when I decided to try out Google’s Gemini API, and it delivered exactly the kind of summaries I had in mind. All I had to do was initialize the API, pass in the comments for a specific islander, and it generated a clean and readable summary of what people were saying. It ended up being a perfect fit for this part of the app.